Chunking Arabic Texts
نویسندگان
چکیده
Text parsing has always benefited from special attention since the first applications of natural language processing (NLP). The problem gets worse for the Arabic language because of its specific features that make it quite different and even more ambiguous than other natural languages when processed. In this paper, we discuss a new approach for chunking Arabic texts based on a combinatorial classification process. It is a modular chunker that identifies the chunk heads using a combinatorial binary classification before recognizing their types based on the parts-of-speech of the chunk heads, already identified. For the experimentation, we use over than 2300 words as training data. The evaluation of the chunker consists of two steps and gives results that we consider very satisfactory (average accuracy of 89,60% for the classification step and 80,46% for the full chunking process).
منابع مشابه
Chunking Using Conditional Random Fields in Korean Texts
We present a method of chunking in Korean texts using conditional random fields (CRFs), a recently introduced probabilistic model for labeling and segmenting sequence of data. In agglutinative languages such as Korean and Japanese, a rule-based chunking method is predominantly used for its simplicity and efficiency. A hybrid of a rule-based and machine learning method was also proposed to handl...
متن کاملتعیین مرز و نوع عبارات نحوی در متون فارسی
Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...
متن کاملPlagiarism Detection and Document Chunking Methods
This paper describes the tests made on chunking methods used for plagiarism detection. The result of the tests makes it possible to decide on the best fitting chunking method for a given application. For example, overlapping word chunking is good for a grammar analyzer or for small databases, sentence chunking suits best for finding quoted texts, hashed breakpoint chunking is the fastest method...
متن کاملJapanese Unknown Word Identification by Character-based Chunking
We introduce a character-based chunking for unknown word identification in Japanese text. A major advantage of our method is an ability to detect low frequency unknown words of unrestricted character type patterns. The method is built upon SVM-based chunking, by use of character n-gram and surrounding context of n-best word segmentation candidates from statistical morphological analysis as feat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012